{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 使用HMM 实现词性标注\n",
    "\n",
    "目标：为一个句子，找到“最好”的词性标签。\n",
    "$$\\arg \\max\\limits_{t_1t_2...t_N} P(t_1,...,t_N|w_1,...,w_N)\\propto\\arg \\max\\limits_{t_1t_2...t_N} \\prod\\limits_{i=1}^N P(t_i|t_{i-1})P(w_i|t_{i})$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.corpus import brown"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "先看一下brown语料中，句子的形式，都是形如：[(I,NOUN), (Love, VERB), (You, NOUN)]\n",
    "即：(word,tag) 形式"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'> [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), (\"Atlanta's\", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), (\"''\", \"''\"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "print(type(brown.tagged_sents()[0]), brown.tagged_sents()[0])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 1.句子开始和结束标记\n",
    "\n",
    "为语料中每个形如[(word,tag), (word,tag),...]的句子添加(<start>, <start>) 和 (<end>, <end>)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1275872\n"
     ]
    }
   ],
   "source": [
    "start_token = '<start>'\n",
    "end_token = '<end>'\n",
    "\n",
    "brown_tags_words = []\n",
    "for sent in brown.tagged_sents():\n",
    "    brown_tags_words.append((start_token, start_token))\n",
    "    # brown_tags_words = brown_tags_words + sent # 这种写法性能极差\n",
    "    # 交换word和tag的顺序，方便后面使用nltk的ConditionalFreqDist 条件频率计数工具\n",
    "    brown_tags_words.extend([(t[:2],w) for (w,t) in sent]) # 简化词性 原始语料有470多种词性，太复杂\n",
    "    brown_tags_words.append((end_token, end_token))\n",
    "\n",
    "print(len(brown_tags_words))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 2.统计词频\n",
    "\n",
    "发射概率估计：\n",
    "$$\\hat{b}_j(k)=P(w_n=k|t_n=j)=\\frac{Count(w_n=k,t_n=j)}{Count(t_n=j)}$$\n",
    "\n",
    "为了方便，这里直接使用nltk自带的统计工具"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [],
   "source": [
    "# Calculate conditional frequency distribution 返回FreqDist对象的集合\n",
    "# 返回的key是条件，即tag value是FreqDist对象 即词与词频\n",
    "cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words) # count(wk, tj)\n",
    "# 使用MLE（极大似然估计）计算conditional probability distribution，即用词频信息去计算概率\n",
    "# 返回的key是条件 即tag，value是p(wk|tj)\n",
    "cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist) # p(wk|tj)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "51\n",
      "dict_keys(['<start>', 'AT', 'NP', 'NN', 'JJ', 'VB', 'NR', 'IN', '``', \"''\", 'CS', 'DT', '.', '<end>', 'RB', ',', 'WD', 'HV', 'CC', 'BE', 'TO', 'PP', 'DO', 'AP', 'QL', 'AB', 'WR', 'CD', 'MD', 'PN', 'WP', '*', 'EX', ':', '(', ')', 'RP', '--', 'OD', ',-', \"'\", '(-', ')-', 'FW', 'UH', ':-', '.-', '*-', 'WQ', 'RN', 'NI'])\n"
     ]
    }
   ],
   "source": [
    "print(len(cpd_tagwords))\n",
    "print(cpd_tagwords.keys())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of an adjective (JJ) being 'new' is 0.01472344917632025\n",
      "The probability of a verb (VB) being 'duck' is 6.042713350943527e-05\n"
     ]
    }
   ],
   "source": [
    "print(\"The probability of an adjective (JJ) being 'new' is\", cpd_tagwords['JJ'].prob('new'))\n",
    "print(\"The probability of a verb (VB) being 'duck' is\", cpd_tagwords['VB'].prob('duck'))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "转移概率估计：\n",
    "$$\\hat{a}_{ij}=P(t_{n+1}=j|t_n=i)=\\frac{Count(t_n=i,t_{n+1}=j)}{Count(t_n=i)}$$\n",
    "\n",
    "初始状态概率分布：\n",
    "$$\\hat{\\pi}_i=P(t_{1}=i|t_0='start')=\\frac{Count(t_0='start',t_{1}=i)}{Count(t_0='start')}$$"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [],
   "source": [
    "brown_tags = [t for (t,w) in brown_tags_words] # 取出所有tag\n",
    "# Calculate conditional frequency distribution 返回FreqDist对象的集合\n",
    "# 返回的key是条件，即tag value是FreqDist对象 即tag与频率 （bigram tag）\n",
    "cfd_tags = nltk.ConditionalFreqDist(nltk.bigrams(brown_tags)) # count(t{i-1}, ti)\n",
    "# print(cfd_tags.keys())\n",
    "# print(cfd_tags['<start>']['NN'])\n",
    "# 返回的key是条件 即tag，value是p(ti|t{i-1})\n",
    "cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist) # p(ti|t{i-1})\n",
    "# print(cpd_tags.keys())\n",
    "# print(cpd_tags['<start>'].prob('NN'))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "If we have just seen 'DT', the probability of 'NN' is 0.5057722522030194\n",
      "If we have just seen 'VB', the probability of 'JJ' is 0.016885067592065053\n",
      "If we have just seen 'VB', the probability of 'NN' is 0.10970977711020183\n"
     ]
    }
   ],
   "source": [
    "print(\"If we have just seen 'DT', the probability of 'NN' is\", cpd_tags[\"DT\"].prob(\"NN\"))\n",
    "print( \"If we have just seen 'VB', the probability of 'JJ' is\", cpd_tags[\"VB\"].prob(\"DT\"))\n",
    "print( \"If we have just seen 'VB', the probability of 'NN' is\", cpd_tags[\"VB\"].prob(\"NN\"))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "给出一个句子：‘I want to race’ ，它与tag 序列：'PP VB TO VB' 匹配度有多高呢？\n",
    "\n",
    "其实就是求：$P(w_1,w_2,...,w_N, t_1,t_2,...,t_N)\\approx P(t_0)\\prod\\limits_{i=1}^N P(t_i|t_{i-1})P(w_i|t_{i})$\n",
    "\n",
    "所以匹配度即概率为：\n",
    "$$\\begin{aligned}P(start)*&P(PP|start)*P(I|PP)*\\\\&P(VB|PP)*P(want|VB)*\\\\&P(TO|VB)*P(to|TO)*\\\\&P(VB|TO)*P(race|VB)*\\\\&P(end|VB)\\end{aligned}$$\n",
    "\n",
    "【注意】：p(start)=1"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of the tag sequence '<start> PP VB TO VB <end>' for 'I want to race' is: 0.0\n"
     ]
    }
   ],
   "source": [
    "prob_of_tagsentences = cpd_tags['<start>'].prob('PP') * cpd_tagwords['PP'].prob('I') * \\\n",
    "    cpd_tags['VB'].prob('PP') * cpd_tagwords['VB'].prob('want') * \\\n",
    "    cpd_tags['TO'].prob('VB') * cpd_tagwords['TO'].prob('to') * \\\n",
    "    cpd_tags['VB'].prob('TO') * cpd_tagwords['VB'].prob('race') * \\\n",
    "    cpd_tags['<end>'].prob('VB')\n",
    "print(\"The probability of the tag sequence '<start> PP VB TO VB <end>' for 'I want to race' is:\", prob_of_tagsentences)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 3.vitervi算法实现\n",
    "\n",
    "解决问题3：已知一个句子，求解最符合的tags序列\n",
    "\n",
    "（1）初始化\n",
    "（2）递推\n",
    "（3）终止\n",
    "（4）最优路径回溯"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'WR': 0.0, 'NP': 1.7319067623793952e-06, 'CD': 0.0, ':-': 0.0, '(-': 0.0, ')': 0.0, 'AB': 0.0, ',': 0.0, '.-': 0.0, '*-': 0.0, 'OD': 0.0, 'DO': 0.0, '*': 0.0, 'CC': 0.0, 'FW': 0.0, 'WQ': 0.0, 'IN': 0.0, 'MD': 0.0, 'RP': 0.0, 'QL': 0.0, 'EX': 0.0, \"'\": 0.0, '.': 0.0, 'RN': 0.0, 'RB': 0.0, \"''\": 0.0, 'WP': 0.0, 'DT': 0.0, 'BE': 0.0, 'AP': 0.0, 'PN': 0.0, 'WD': 0.0, '``': 0.0, 'TO': 0.0, 'HV': 0.0, 'CS': 0.0, 'VB': 0.0, '(': 0.0, ')-': 0.0, '--': 0.0, ',-': 0.0, 'UH': 0.0, 'PP': 0.014930900689060006, 'AT': 0.0, ':': 0.0, 'NN': 1.0580313619573935e-06, 'NR': 0.0, 'NI': 3.3324520848931064e-07, 'JJ': 0.0, '<end>': 0.0}\n",
      "{'WR': '<start>', 'NP': '<start>', 'CD': '<start>', ':-': '<start>', '(-': '<start>', ')': '<start>', 'AB': '<start>', ',': '<start>', '.-': '<start>', '*-': '<start>', 'OD': '<start>', 'DO': '<start>', '*': '<start>', 'CC': '<start>', 'FW': '<start>', 'WQ': '<start>', 'IN': '<start>', 'MD': '<start>', 'RP': '<start>', 'QL': '<start>', 'EX': '<start>', \"'\": '<start>', '.': '<start>', 'RN': '<start>', 'RB': '<start>', \"''\": '<start>', 'WP': '<start>', 'DT': '<start>', 'BE': '<start>', 'AP': '<start>', 'PN': '<start>', 'WD': '<start>', '``': '<start>', 'TO': '<start>', 'HV': '<start>', 'CS': '<start>', 'VB': '<start>', '(': '<start>', ')-': '<start>', '--': '<start>', ',-': '<start>', 'UH': '<start>', 'PP': '<start>', 'AT': '<start>', ':': '<start>', 'NN': '<start>', 'NR': '<start>', 'NI': '<start>', 'JJ': '<start>', '<end>': '<start>'}\n"
     ]
    }
   ],
   "source": [
    "# 先拿出所有的tags\n",
    "distinct_tags = set(brown_tags)\n",
    "\n",
    "sentence = [\"I\", \"want\", \"to\", \"race\"]\n",
    "\n",
    "sent_len = len(sentence)\n",
    "\n",
    "score = [] # best path score\n",
    "backpointer = [] # 记录best path 节点的前一个节点\n",
    "\n",
    "# （1）初始化\n",
    "first_score = {}\n",
    "first_backpointer = {}\n",
    "# 初始化score 和 backpointer\n",
    "for tag in distinct_tags:\n",
    "    if tag == '<start>': # 没有以start结尾的句子\n",
    "        continue\n",
    "    # score[t1,w1]=p(start) * p(t1|start) * (w1|t1)=1*p(t1|start) * (w1|t1)\n",
    "    # 或者\\delta_{1}(i)=\\pi_{i} b_{i}(o_{1})\n",
    "    p_t1_c_start = cpd_tags['<start>'].prob(tag) * cpd_tagwords[tag].prob(sentence[0])\n",
    "    first_score[tag] = p_t1_c_start\n",
    "    first_backpointer[tag] = '<start>'\n",
    "\n",
    "# 得到第一个“最优”子路径和第一个回溯点\n",
    "print(first_score)\n",
    "print(first_backpointer)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word 'I' current best two-tag sequence: <start> PP\n"
     ]
    }
   ],
   "source": [
    "# 看看目前最好的tag是哪一个？\n",
    "curr_best = max(first_score.keys(), key=lambda tag: first_score[tag])\n",
    "print( \"Word\", \"'\" + sentence[0] + \"'\", \"current best two-tag sequence:\", first_backpointer[curr_best], curr_best)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "outputs": [],
   "source": [
    "# 将第一个节点加入score和backpointer\n",
    "score.append(first_score)\n",
    "backpointer.append(first_backpointer)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word 'want' current best two-tag sequence: PP VB\n",
      "Word 'to' current best two-tag sequence: VB TO\n",
      "Word 'race' current best two-tag sequence: IN NN\n"
     ]
    }
   ],
   "source": [
    "# （2）递推\n",
    "# 完成初始化了，就可以开始inductive 递推了\n",
    "for i in range(1, sent_len):\n",
    "    this_score = {}\n",
    "    this_backpointer = {}\n",
    "    prev_score = score[-1] # 所有的前驱节点的score score表示了到该点截至，目前最优路径的score\n",
    "\n",
    "    for tag in distinct_tags:\n",
    "        if tag == '<start>':\n",
    "            continue\n",
    "        best_prevtag = max(prev_score.keys(), key=lambda prevtag: prev_score[prevtag] * \\\n",
    "                                                               cpd_tags[prevtag].prob(tag) * \\\n",
    "                                                               cpd_tagwords[tag].prob(sentence[i]))\n",
    "        this_score[tag] = prev_score[best_prevtag] * cpd_tags[best_prevtag].prob(tag) * \\\n",
    "                          cpd_tagwords[tag].prob(sentence[i])\n",
    "        this_backpointer[tag] = best_prevtag\n",
    "\n",
    "    # 我们把当前最好的tag打印一下\n",
    "    curr_best = max(this_score.keys(), key = lambda tag: this_score[tag])\n",
    "    print( \"Word\", \"'\" + sentence[i] + \"'\", \"current best two-tag sequence:\", this_backpointer[curr_best], curr_best)\n",
    "\n",
    "    # 将当前节点加入score和backpointer\n",
    "    score.append(this_score)\n",
    "    backpointer.append(this_backpointer)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best_score: 5.71772824864617e-14\n"
     ]
    }
   ],
   "source": [
    "# （3）终止\n",
    "# 找所有以<end>结尾tag_sequence， score中最大值\n",
    "prev_score = score[-1]\n",
    "best_prevtag = max(prev_score.keys(), key=lambda prevtag: prev_score[prevtag] * \\\n",
    "                                                               cpd_tags[prevtag].prob('<end>')\n",
    "                   )\n",
    "best_score = prev_score[best_prevtag] * cpd_tags[best_prevtag].prob('<end>')\n",
    "print('best_score:', best_score)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The sentence was: ['I', 'want', 'to', 'race']\n",
      "The best tag sequence is: ['<start>', 'PP', 'VB', 'IN', 'NN', '<end>']\n"
     ]
    }
   ],
   "source": [
    "# （4）最优路径回溯\n",
    "best_tag_seqs = ['<end>', best_prevtag] # 准备开始\n",
    "curr_best_tag = best_prevtag\n",
    "for bp in backpointer[::-1]: # 从后往前tracking\n",
    "    best_tag_seqs.append(bp[curr_best_tag])\n",
    "    curr_best_tag = bp[curr_best_tag]\n",
    "\n",
    "best_tag_seqs.reverse()\n",
    "\n",
    "print('The sentence was:', sentence)\n",
    "print('The best tag sequence is:', best_tag_seqs)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The probability of the best tag sequence is: 5.71772824864617e-14\n"
     ]
    }
   ],
   "source": [
    "print( \"The probability of the best tag sequence is:\", best_score)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "结果不是很好，说明要加更多的语料"
   ],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}